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Abstract — Phylogenetic data arising on two possibly different 
tree topologies might be mixed through several biological mech- 
anisms, including incomplete lineage sorting or horizontal gene 
transfer in the case of different topologies, or simply different 
substitution processes on characters in the case of the same 
topology. Recent work on a 2-state symmetric model of character 
change showed that for 4 taxa such a mixture model has non- 
identifiable parameters, and thus it is theoretically impossible 
to determine the two tree topologies from any amount of data 
under such circumstances. Here the question of identifiability is 
investigated for 2-tree mixtures of the 4-state group-based models, 
which are more relevant to DNA sequence data. Using algebraic 
techniques, we show that the tree parameters are identifiable for 
the JC and K2P models. We also prove that generic substitution 
parameters for the JC mixture models are identifiable, and for 
the K2P and K3P models obtain generic identifiability results 
for mixtures on the same tree. This indicates that the full 
phylogenetic signal remains in such mixtures, and that the 2- 
state symmetric result is thus a misleading guide to the behavior 
of other models. 



I. Introduction 

A basic question concerning any statistical model is whether 
a probability distribution arising from the model uniquely de- 
termines the parameters that produced it. If so, the parameters 
are said to be identifiable. Indeed, parameter identifiability is 
necessary for the consistency of inference. 

In phylogenetics, it is especially important that the tree 
parameter of a model be identifiable, so that evolutionary 
histories can be consistently inferred. For basic models of 
character evolution along a tree, in which all sites behave 
independently and identically, identifiability of both the tree 
and continuous parameters is long-established. However, as 
phylogenetic models grow in complexity, it becomes increas- 
ingly difficult to analyze the models thoroughly enough to be 
certain this property is retained. Indeed, mixture models of 
all sorts present difficulties, though positive results have been 
obtained for models with a small number of classes evolving 
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on the same tree [5], and those with scaled T-distributed rates 
[1]. However, even for the GTR+r+I model, which is currently 
the most commonly used in DNA data analysis, it is yet to be 
proved that trees are identifiable. 

Several recent works, including [25], [34], [24], and [23], 
considered 2-class mixture models in which the two classes 
evolve along possibly different topological trees. Such models 
could describe instances of horizontal transfer of genetic ma- 
terial between taxa, or incomplete lineage sorting in sequences 
composed of several concatenated genes. In particular, Matsen 
and Steel [24] showed that under the binary symmetric model 
of Cavender-Farris-Neyman, a 2-class mixture on a single 4- 
taxon tree can exactly 'mimic' a single class model on a 
different tree. Because of the small size of the state space in 
this model, its group-based structure, and the small size of the 
tree, explicit calculations were possible to fully analyze this 
situation. However, one should be cautious about extrapolating 
from this result to a pessimistic view about identifiability of 
similar phylogenetic mixtures. The mixture of [24] is an 11- 
parameter model producing a probability distribution in a 7- 
dimensional space, so it is certainly overparameterized. While 
this dimension count does not guarantee non-identifiability of 
the tree, it does explain why it might likely arise. 

By either passing to models with larger state spaces, such 
as 4-state models appropriate to DNA, or by considering 
trees relating more taxa, the joint distribution of states at the 
leaves of the tree will be embedded in a larger dimensional 
space. Thus we might hope to avoid overparameterization 
issues through either of these modifications. As the analysis of 
real biological data typically involves both of these changes, 
these are the types of mixture models it is most desirable to 
understand. 

Here we consider 2-class mixtures analogous to those in the 
works above, but for larger trees and/or state spaces. 

We continue to work with group-based models, focusing 
primarily on those for DNA, so that we retain the powerful 
tool of the Fourier/Hadamard coordinate transformation. 

We also make use of computational algebra software to 
perform calculations well beyond what could be done 'by 
hand.' Our results on identifiability are generally quite positive, 
and although these group-based models are still special cases, 
we believe they provide a better guide to the behavior of more 
realistic models than those of [24]. 

This paper is organized as follows. In Section [II] we in- 
troduce 2-tree mixture models and the identifiability problem 
in the algebraic setting. Background on group-based models 
is covered in Section [TTTJ from basic definitions through their 
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presentation in terms of Fourier coordinates. 

Section [IV] deals with identifiability of the tree parameters 
for Jukes-Cantor and Kimura 2-parameter mixture models on 
two trees. The main result, that tree parameters in such mix- 
tures on at least 4 taxa are generically identifiable, is Theorem 
[TOl and its corollary. Even with generic tree identifiability 
proved for 2-tree mixtures, a natural question is whether a 
single-class (unmixed) model can be distinguished from a 2- 
tree mixture. (This is not answered by the previous result, since 
while a single-class model is a special case of a 2-class model, 
it is non-generic.) We investigate this problem in Section [V] 

Finally, in Section [VTJ we turn to identifiability of the 
continuous parameters of these models, assuming the tree 
parameters are known. One feature of a part of our analysis 
is the use of computational algebra software to obtain some 
results with very high probability. Although technically these 
remain conjectures, lacking rigorous proof, the conclusions we 
draw from such calculations are highly reliable for theoretical 
reasons. While using calculations this way is familiar to 
applied algebraic geometers, this approach may be new to 
others, so we begin the section by explaining the reasoning 
informally. With this qualification, we establish the generic 
identifiability of continuous parameters for the JC model when 
either n > 5, or n — 4 and the trees are distinct. In the case 
of identical trees with n > 5 taxa, we give a fully rigorous 
argument for the three group-based models: JC, K2P, and K3P. 
An interesting non-identifiable case arises from the Jukes- 
Cantor mixtures on two identical 4-taxon trees. 

Command files and instructions for verifying all our com- 
putations using the software Singular [16] can be found at 
the supplementary materials website for this paper [3]. We 
include both computations supporting our arguments, and 
those producing our examples. 

We would, of course, prefer to push the work here beyond 
the group-based models, to include those more routinely used 
in current data analysis. It is possible, after all, that the group- 
based models are special enough that identifiability results for 
them do not carry over to more elaborate models. However, our 
current computational and theoretical tools are not sufficient 
for us to address questions for more general models. 

II. Preliminaries 

Consider a phylogenetic model of fc-state character change 
on n-taxon trees (e.g., for k = 4, the Jukes-Cantor model). 
We assume the taxa labelling the leaves are identified with 
[to] = {1, 2, , . . . , n}. Then for each leaf-labelled tree T, there 
is a parameterization map ipT giving the joint distribution of 
states at the leaves of the tree T as functions of continuous 
parameters. With St denoting the continuous parameter space 
on T, which we assume is some full-dimensional subset of 

where A k _1 C [0, l] fe is the probability simplex comprised 
of non-negative real vectors summing to 1 . 

Given such a model, the associated 2-tree mixture model 
has the following parameterization maps: For every pair of n- 
taxon trees T\ and T 2 on the same taxa, let St x .t 2 = St x x 



St 2 x [0, 1] and 

V'Ti.Ta : St u T 2 — > A fc _1 , 

be defined by 

4>T 1 ,T 2 {si,s 2 ,Tr) = irip Tl (si) + (1 - ir)ip T2 (s 2 ). 

Here it is the mixing parameter, giving the proportion of 
i.i.d. sites that evolve along tree T\. 

We will only consider algebraic models, for which the maps 
ipT, and hence i/jti,t 2 , are defined by polynomial formulas. 
This is a small restriction, as many models (e.g., standard 
continuous-time models) which are not polynomial can be 
embedded in ones that are (e.g., the general Markov model). 
Algebraic models can be studied from the perspective of 
algebraic geometry [12], after extending ipT and i\)t u t 2 to 
complex polynomial maps, with images in C fc . We refer to 
St and St-l,t 2 as stochastic parameter spaces, to distinguish 
them from the complex parameter spaces of these extensions. 

We denote by Vr the algebraic variety which is the Zariski 
closure of the image of ipT in the complex projective space 
P fe _1 . (See [10], [17] for background in algebraic geometry.) 
Then the closure of the image of ^t x ,t 2 is a variety called the 
join of Vt x and Vt 2 , denoted by 

VTi * VT2 ■ 

The join can be described geometrically as the smallest variety 
containing all lines intersecting both Vt x and Vr 2 ■ In the case 
T\ = T 2 the join is called the secant variety of Vt x ■ 

We use M.t, and M.t x * M.t 2 , to denote the image of 
the parameterization maps when applied only to the stochastic 
parameter spaces St and <Sti,t 2 - Thus these denote the sets 
of all probability distributions arising from the parameterized 
models, and 

Mt £ Vr, Mt x * Mt 2 £ Vr 2 * Vt 2 - 

While M.t and M.t x * Mt 2 are of course the objects of pri- 
mary interest to phylogenetic applications, the larger complex 
varieties Vr and Vt x * Vt 2 are more amenable to algebraic 
study. 

Another parameterization of a dense subset of Vt x * Vr 2 , 
which we will also use, is 

4>t u t 2 ■ Vti x Vt 2 x P 1 — > Vti * Vr 2 , 

which when restricted to an affine subset simply maps points 
on the two varieties to their convex sum using the third 
coordinate as a weight. (The dashed arrows indicates the map 
is only defined on a dense subset of the stated domain.) If 
TreCcP 1 , then 

^Ti,T 2 (si,S 2 ,7r) = </>T 1 ,T 2 ('0T 1 (si),'0T 2 (s2),7r)- (1) 

Associated to any algebraic variety V is the ideal X = I(V) 
of polynomials that defines it; namely, a polynomial / 6 X if, 
and only if, for any point v 6 V, /(v) = 0. For a variety 
associated to a phylogenetic model, such polynomials give 
constraints that entries of a distribution of states at the leaves 
of a tree must satisfy if it arises from the given model. First 
introduced in phylogenetics by Cavender and Felsenstein [8] 
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and Lake [22], these polynomials are known as phylogenetic 
invariants, and have been studied extensively in many papers, 
including [13], [32], [33], [29], [20], [31], [6], [4]. 

For algebraic models, it is convenient to slightly weaken 
the notion of identifiability to generic identifiability . The 
word 'generic' is used to mean 'except on a proper algebraic 
subvariety' of the parameter space. Although it is sometimes 
possible to be explicit about this subvariety, we usually are 
not, since the key point in interpretation is that the subvariety 
is a closed set of Lebesgue measure inside the larger set. 
Thus regardless of the precise subvariety involved, 'randomly' 
chosen points are generic with probability 1. 

An additional issue for identifiability of 2-tree mixtures 
is class swapping: Interchanging the trees, along with their 
parameters, while replacing the mixing parameter it by 1 — 7T, 
has no effect on the resulting distribution. Thus, a useful notion 
of identifiability must allow for this. 

Definition 1: The tree parameters of the 2-tree mixture 
model are generically identifiable if, for any binary trees T\ , T 2 
on the same set of taxa, and generic choices of s\, s 2 ,ir, 

iPt 1 ,t 2 (s 1 ,s 2 ,tt) = ^^(Sl! 8 ^') 

implies {T U T 2 } = {T[,T^}. 

Definition 2: The continuous parameters of a 2-tree mixture 
model on T\ and T 2 are generically identifiable if for generic 
choices of si, s 2 , 7r, 



V) 



implies (si,s 2 ,tt) = (s[, s' 2 , tt'), or, in the case where T\ = 
T 2 , (si,s 2 ,n) = (s 2 ,si,l -tt). 

Let K C [n] be a subset of the leaf set. For any tree T on 
n leaves, T\k will denote the induced subtree of T with leaf 
set K. Since marginalization onto leaf subsets is a linear map 
that preserves the mixture structure of a phylogenetic model, 
we obtain the following useful fact. 

Lemma 3: Let Ti,T 2 ,T$, T4 be n-taxon trees, not necessar- 
ily distinct, and let K C [n]. If V Tl \ K *V T2 \ K % V Ta \ K *V Tl \ K , 
then Vt x * Vr 2 % Vr 3 * Vr 4 ■ 

Proof: Marginalization to a fixed set K gives a linear 



map from C fc to C fc ' A ', which sends Vr to Vt\ k for any T. 
For any linear transformation A and any varieties, V, W, we 
have = AV*AW, because the mixture construction 

is a linear operation. Since for any sets Si,S 2 , and any map 
f, f(Si) % f{S 2 ) implies that Si % S 2 , the lemma follows. 



III. Group-based Phylogenetic Models 

Group-based phylogenetic models will be the main subject 
of study in this paper, so we collect known results about 
these models, including their natural representation in Fourier 
coordinates. 

Throughout we assume that all trees T are binary. We root 
T by picking an arbitrary edge, and introducing a root p as 
a distinguished node of degree 2 along it. Thus, every edge 
of T may be considered directed away from the root. To each 
node v in the tree, we associate a discrete random variable 
X v with k states, and to each directed edge e in the tree, 



we associate a Markov transition matrix M e , describing the 
conditional probabilities of various state changes. We assume 
the reader is familiar with the usual assumptions of Markov 
processes on trees [27]. The joint distribution of states at the 
leaves ofT may be computed (the image of ipr), once the root 
distribution and the collection of {M e } are specified. We refer 
to the entries of the root distribution and the Markov matrices 
as the continuous parameters of the model. 

Definition 4: Let T be a binary tree rooted at p. Let G be a 
finite abelian group of order k, and identify its elements with 
the state space of the random variables X v . Then a group- 
based model on T for the group G is a phylogenetic model 
with a uniform root distribution, and transition probabilities 
on each edge e satisfying 

M e (g,h)=f e (g-h) 

for some functions / e : G — » K. 

Some standard examples of group-based models are the 
binary symmetric model, a.k.a. the Cavender-Farris-Neyman 
(CFN) model, which is associated to the group "L 2 ; and 
the Jukes-Cantor (JC) model, the Kimura 2-parameter (K2P) 
model, and the Kimura 3-parameter (K3P) model, which are 
associated to the group "L 2 x*Z 2 . With appropriate ordering of 
the state spaces, the transition matrices for these models have 
the following forms, respectively: 
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models are subject to a remarkable lin- 
ear change of coordinates, called the discrete Fourier, or 
Hadamard, transform [18], [19], [13], [32], [33]. After ap- 
plying the Fourier transform the models are seen to be toric 
varieties [31]. In particular, the transformed image coordinates 
are given in terms of transformed domain coordinates by a 
monomial parameterization. 

To make this parameterization explicit, henceforth let G be 
Z 2 or Z2 x Z 2 , and T an n-taxon tree. The Fourier coordinates 
for a group-based model are denoted q gi ,.. 9n , where <?j G 
G for all i. Let S(T) be the set of splits induced by the 
edges of T. To each split A\B <E S(T), we associate a set of 



parameters: a 



A\l 



where g 6 G. The toric parameterization for 



the model is then given by: 



Qgu 



n 



A\B 



A\B£Z{T) "EieAft 
0, 



otherwise. 



(2) 

Note that by our choices of G, when £™=i 9i = we will 
have 2~2ieA 9i ~ 2~2ieB 9i f° r an y S P^ A\B. Thus the formula 
above does not depend on the ordering of the sets in the splits. 

To ease notation, we describe trees by omitting trivial splits 
associated to leaf edges (i.e., those of the form {i} | [n] \ {i}). 
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When describing the toric parametrization of a group-based 
model, we abbreviate the parameters associated with the edge 
leading to leaf i by a* . 

For elements in the group G = Z2 x Z2 associated to 
the Jukes-Cantor and Kimura models, we (arbitrarily) identify 
nucleotides with the group elements in the following way: 
A = (0,0), C = (0,1), G = (1,0), and T = (1,1). We 
illustrate these notions with an example on a 5-taxon tree. 

Example 5: Let T = {12|345, 123|45}. The toric parame- 
terization for the K3P model is given by formulas of the form: 



99192939495 
,5 



1 2 3 4 5 12|345 123|45 

a gi a 92 a ga a 9i a Q5 a gi +92 a gi +92 +93 



= otherwise. For 



where £ i=1 & = 0, and g 9l529394ff5 
example, 

1 2 3 4 5 12|345 123|45 

qcccTG = a c a c a c a T a G a A a c 

For the JC and K2P models, the Fourier coordinates are 
described by simply imposing additional relationships on the 
continuous parameters. 

Proposition 6: [13], [19] In Fourier coordinates, the K3P 
model on a tree T consists of all the Fourier vectors arising 
from representation (O so that, for each split e, a\ = 1, and 
a G , a G , afn £ (0, 1]. The K2P model is the submodel of the 
K3P model satisfying additionally, that for all splits e, a e G = 
a^p. The JC model is the submodel of the K2P model satisfying 
additionally, that for all splits e, a c c = a c G = af,. 

Significantly for the work later in this article, the Fourier 
transform is a linear change of coordinates. Thus the operation 
of taking tree mixtures commutes with the Fourier transform, 
which allows us to naturally represent the mixture models we 
consider in Fourier coordinates. Though these mixture models 
are not toric, we still gain insight from this viewpoint. 

To close this section, we make several comments regarding 
some combinatorial aspects of Fourier coordinates for group- 
based models. As linear invariants are crucial to the arguments 
below, we first discuss enumeration of distinct Fourier coor- 
dinates, and computations of the dimension of the space of 
linear invariants for a model. In closing, we illustrate some 
useful combinatorial mnemonics for working with Fourier 
coordinates and identifying invariants. Although these devices 
are likely familiar to experts in group-based models, we hope 
our exposition will be useful to those less familiar with these 
models. 

The zero set of the linear invariants for any variety V C 
P™ is the smallest linear subspace of P™ containing V. This 
set is called the span of V, Span(y). The span of a finite 
collection of varieties is defined similarly, as the span of their 
union. Note that, by the join construction, it is immediate that 
Span(y * W) = Span(V, W). 

For group-based models on an n-taxon tree (that is, an 
unmixed model), the number of distinct Fourier coordinates 
is precisely the dimension of the span of the model, as 
there are no linear relations between distinct monomials. This 
establishes: 

Proposition 7: For the CFN and K3P models, there are no 
non-trivial linear invariants. The number of distinct Fourier 
coordinates is 2"" 1 for the CFN model and 4™" 1 for the K3P 
model. 



As our method of investigation of mixture models in Section 
[TVl depends upon the existence of linear invariants, we thus 
focus on the JC and K2P group-based models. 

Steel and Fu [29] computed the dimension of Span(Px) 
for the JC model. Hendy and Penny [20] performed a similar 
computation for the K2P model. 

Theorem 8: [29], [20] Let T be an n-taxon binary tree. 
Then, for the JC model on T, the number of distinct Fourier 
coordinates is the Fibonacci number F^n-i, satisfying the 
recurrence 

Fq = 1, Fx = 1, Fi = F^x + Fi-2- 

That is, for the JC model Span(Vr) has dimension F-zn-i- 

For the K2P model on T, the number of distinct Fourier 
coordinates is H n , satisfying the recurrence: 

Hi = 1, H2 = 3, Hj = 4Hi-i — 2Hi_2- 

That is, for the K2P model Span(Vr) has dimension H n . 

Fourier coordinates for group-based models have useful 
combinatorial representatives in terms of labelled or colored 
versions of the underlying tree T. For this representation, we 
associate a color to each of the different parameter classes 
in the model. For example, in the JC model, there are two 
parameter classes: the A class (grey), and the {C, G,T} 
class (black). With this choice of colors, in the parametric 
description of <? ffl ,..., ffn if a parameter a^' B occurs, then we 
color the edge corresponding to the split B\B' grey. If a 



parameter a 



B\B' B\B' 



, or CLp B occurs, we color the edge 
corresponding to the split B\B' black. As shown in [29], 
this establishes a correspondence between distinct Fourier 
coefficients for the JC model and subforests of T with the 
same leaf set [n]. 

The color-coding of the underlying tree works similarly for 
the K2P model; here we have three classes of parameters, the 
A-class, the C-class, and the {G, T}-class, and hence use three 
colors. 

Example 9: Continuing Example [5] the colored diagram 
corresponding to the Fourier coordinate qcccTG f° r the JC 
model is shown in Figure Q] 




Fig. 1. JC-coloring for qcC'CTG- (Key: A-class = dashed grey, {C, G, T}- 
class = black) 

For the K2P model, the same Fourier coordinate is repre- 
sented by the tri-colored tree of Figure [2] 

These diagrams are useful for determining the invariants 
that a particular group-based model satisfies. For instance, one 
phylogenetic invariant for the JC model on the tree T with split 
12 134 is given in Fourier coordinates by 



qCTGAlACTG = qCGCGqACCA- 



(3) 



5 



3 




Fig. 2. K2P-coloring for qc'CCTG- (Key: yt-class = dashed gray, C-class 
= black, {G, T}-class = dashed black) 

This relationship may be represented in pictorial form by the 
diagram in Figure [3] 




2 4 2 4 




Fig. 3. Pictorial view of invariant {5) for the JC model on T. 



IV. IDENTIFI ABILITY OF TREE PARAMETERS 

The goal of this section is to prove that the tree parameters 
are generically identifiable for 2-tree JC and K2P mixture 
models on at least 4 taxa. For the complex varieties associated 
to the models this is formalized as follows: 

Theorem 10: Suppose T\,T2,Tz,T± are binary trees, not 
necessarily distinct, on n > 4 taxa, and consider the 2-tree 
mixture varieties for the JC and K2P models. If {Ti,Tz} ^ 
{T 3 ,T 4 }, then V Tl * Vt 2 % Vt 3 * V Ta . 

If {Ti,T 2 } 7^ {T 3 ,T 4 }, then the noncontainment of the 
irreducible varieties Vt x * Vr 2 and Vt 3 * Vr 4 in one another 
shows their intersection is a proper subvariety of strictly lower 
dimension. The preimage of this intersection under the com- 
plex parameterization map is thus a proper subvariety of the 
parameter space. Since the subset of stochastic parameters is 
Zariski dense in the complex parameter space, those stochastic 
parameters mapping to the intersection also lie in a closed set 
of Lebesgue measure 0. Thus we obtain the main result of the 
section: 

Corollary 11: For the 2-tree JC and K2P mixture models on 
at least 4 taxa the tree parameters are generically identifiable 
for either stochastic or complex parameters. 

The proof of Theorem [10] proceeds in three parts. First, 
we show that the two tree parameters when T\ = T2 are 
identifiable. Then we focus on the quartet trees, constructing 
a linear invariant that completes the proof of the result for 
quartet tree mixtures. Finally, we combine our quartet results, 
the six-to-infinity theorem of Matsen, Mossel, and Steel [23], 
and a linear invariant for 6-taxon tree mixtures, to deduce 
identifiability of trees for 2-tree mixtures on an arbitrary 
number of leaves. 



A. 2-Tree Mixture with T\ = T2 

In this section, we focus on a mixture of a tree with itself; 
that is, we study the secant variety Vt * Vt- We show that 
Vt *Vt can be distinguished from any 2-tree mixture variety 
Vri * Vt 2 , provided T\ and T2 are not both T. 

Proposition 12: Let Ti,T^,T4 be three binary trees, not 
necessarily distinct, with n > 4 leaves, such that {T\} 7^ 
{T 3 ,T 4 }. Then under both the JC and K2P 2-tree mixture 
models, Vt x * Vt x 2 ^x 3 * Vt 4 and Vr 3 * Vr 4 % Vt x * Vt x ■ 

Proof: Assume T 3 ^ T x . By [29] and [20], for both the 
unmixed JC and K2P models, there exists a linear invariant 
I G Z(Vti) \3-(Vt 3 )- Since the set of linear invariants of Vt x 
and Vt x * Vji coincide (the varieties have the same span), there 
exists a linear invariant I £ 1(1% * V^) \I(Vr 3 )- Hence, 
Vt 3 2 Vt x * Vri ■ Now since Vt 3 C Vt 3 * Vt a , it follows that 
V Ts *V Ti % V Tl *v Tl . 

It remains to show Vt x * Vt x % Vr 3 * Vt 4 - In fact, it is 
enough to show that 

dim Vt 3 * Vt 4 < dim Vt x * Vri • (4) 

Indeed, if this inequality holds strictly, the claim is obvious. If, 
on the other hand, the dimensions are equal, then since both of 
the joins are irreducible varieties, Vt x * Vt x ^ Vr 3 * Vt 4 would 
imply equality of varieties, contradicting the anti-containment 
already established. 

Now a simple bound on the dimension of a join, coming 
from its natural parameterization, is 

dim V * W < dim V + dim W + 1, 

where the quantity on the right is called the expected dimen- 
sion when it is no larger than the dimension of the ambient 
space. In the case of the JC model, dim(V^r 3 ) = dim(V A T 4 ) = 
2?i — 3, which shows dim(Vr 3 * Vr 4 ) < 4n — 5. Similarly, 
&im.(V T3 * V Ti ) < 8n - 11 for the K2P model. 

To complete the proof of the claim of Proposition [12] for 
the JC model by establishing inequality it suffices to show 
the secant variety has the expected dimension, as given by the 
following: 

Lemma 13: If T is an n-taxon binary tree then dimVr * 
Vt = 4n - 5 for the JC model. 

As the proof of this lemma is more involved, we defer it 
until after our current argument. For the K2P model, we prove 
below a weaker claim. 

Lemma 14: If T is a 4-taxon binary tree, then dimVr * 
V T = 21 for the K2P model. 

This is sufficient to complete the proof of the claim of 
Proposition [12] in the K2P case for 4-taxon trees. Larger trees 
are then treated by considering marginalizations to induced 4- 
taxon trees: Choose a set K of 4 taxa for which the induced 
quartet trees T\ | /<■ , T3 1 K , T4 1 a; are not all the same, and then 
apply Lemma [3] ■ 

To prove Lemma [T3l we make use of a special case of the 
tropical secant varieties theory of Draisma [11] and the fact 
that the varieties Vt are toric varieties. To explain Draisma's 
result (Theorem [15] below), we introduce some background 
material on toric varieties and convex geometry. 
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Recall that a toric variety is specified as the image of a 
polynomial map, each of whose coordinate functions is a 
monomial. As a monomial is of the form x u = x\ x x^ ■ ■ ■ x^ d , 
we associate to each monomial a non-negative integer vector 
u. To a toric variety, we associate a collection of non-negative 
integer vectors A C N rf , one vector for each monomial 
appearing in the parameterization. We also identify A with 
a matrix whose columns are the given set. The toric variety 
is often denoted Va- Algebraic and geometrical properties of 
toric varieties are reflected in corresponding properties of the 
vector configuration A [14], [30]. 

By a hyperplane in R d , we mean a linear hypersurface H = 
{x £ M. d : c T x = e}. The complement M. d \H consists of two 
connected components, which we denote by H + and H~ . 

Theorem 15 ([11]): Let Va be a projective toric variety, 
with corresponding set of exponent vectors A C N d . Suppose 
that A has rank r, so that dimVA = r — 1. Let H be 
a hyperplane not intersecting A. Let A + = A n H + and 
A~ = AnH~. Then 6hnVA*VA > rank A + + rank A~ - 1. 
In particular, if there exists an H such that rankv4 + = 
rankv4~ = rank A, then Va has the expected dimension. 

Proof: [Proof of Lemma Qj) To apply Theorem Q3] we 
must investigate the vector configurations A associated with 
the Jukes-Cantor model and find a hyperplane with the desired 
properties. For the Jukes-Cantor model on a binary tree, each 
distinct Fourier coordinate corresponds to a subforest F of the 
tree, and the corresponding monomial has the form 

n <' n a A- 

(Here we consider the a e A as variables, rather than setting them 
to be 1. This simply homogenizes our parameterization.) The 
vector corresponding to F is in pj 4 ™ -6 ; specifically, u F = 
(x e , j/ e )eeE(T) sucn mat x e = 1 an d Ue = if e £ F, and 
x e = and y e = 1 if e ^ F, For example, in the case that 
n = 4, and T is the tree with nontrivial split 12 1 34, then, after 
removing repeated columns, A consists of the columns of the 
10 x 13 matrix: 



the other side will have at least two edges missing. Call the 
first set of vectors A + and the second set of vectors A~ . In 
the matrix above, A + consists of the last six columns and A~ 
consists of the first seven columns. 

The first set A + contains exactly |S(T)| + 1 vectors, since 
the tree itself is a subforest and removing any edge always 
produces a subforest. This set thus forms the vertices of a 
simplex of dimension equal to the number of edges, so A + 
has rank 2n — 2. 

The second set, A~ , contains the empty graph and all 
paths between pairs of vertices. If we restrict attention to 
only those vectors corresponding to the paths between pairs of 
vertices, this gives us the exponent vectors of the toric varieties 
corresponding to toric degenerations of the Grassmannian [28], 
which has rank 2n — 3. Adding the vector corresponding to 
the empty subforest increases the rank by one. ■ 

Proof: [Proof of Lemma [Pfll To apply Theorem Q3] 
we must investigate the vector configurations A associated 
with the K2P model and find a hyperplane with the desired 
properties. For n = 4, there are H4 = 34 distinct Fourier 
coordinates. We focus on the tree with the unique nontrivial 
split 12 1 34. Each monomial in the Fourier parameterization 
has the form 

1 2 3 4 12|34 

a„ a n a n a n a„ \„ . 

9l 92 93 94 91+92' 

where a e G = a e T for all splits e. This implies that the matrix 
A is a 15 x 34 matrix. The coordinates on M 15 are x e ,y e , z e , 
where for a given edge e, 

r (1,0,0) ifg e =A 
(x e ,y e ,z e ) = < (0,1,0) ifg e = C 

{ (0,0,1) if g e = G,T. 

Since the model has dimension 10, we see that the matrix A 
has rank 11. 

Now consider the hyperplane H = {(x e , y e , z e ) £ M 15 : 
Sees(T) Ue + z e ~ 7/2}. A direct calculation shows that this 
partitions A into A + and A~ each with rank 11, and completes 
the proof. ■ 
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The first 2 rows here correspond to the x e for edges in one 
cherry on the tree, the next 2 to x e for edges in the other cherry, 
and the 5th to x e for the central edge; the last 5 correspond 
to the y e , with edges in the same order. Thus, the first column 
corresponds to the empty forest, the second to the first cherry, 
and so on. 

Consider the hyperplane H = {(x e ,y e ) £ R 4 " -6 : 



Sees(T) Xe = 1^(^)1 — 3/2- This means that the vectors 
on one side of the partition will correspond to subforests of 
T having at most one edge of T missing. The subforests on 



B. Linear Invariants for Quartet Mixtures 

We next focus on quartet trees. The three fully-resolved 
quartet trees will be indicated by their non-trivial splits: T 12 |34, 
T 13 |24, and T 14 | 2 3. The main result of this section is that linear 
invariants can generically identify 2-quartet mixtures. 

Lemma 16: For both the JC and K2P models, the linear 
polynomial 

/ = Qgggg + Qgtgt — Qggtt — QGTTG 

satisfies / e T{V Tm3i * Vr I4|23 ) \l(V Tl3]24 ). 
Note the lemma further implies 

/ G 24Tt 12 | 34 * V TlM23 ) \I(V TimA * V Tli[23 ). 

Combining this with Proposition Q~2] we deduce a first case of 
Theorem [TOj 

Corollary 17: The case n = 4 of Theorem ITOl holds. 
Proof: [Proof of Lemma [16) Denote the parameters for 
tree X12134 by a and the parameters for T 14 i 2 3 by b. We must 
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show that / = whenever we substitute for the q's the 
parameterization for the mixture model. One checks that: 



1 2 3 4 12|34 
QGGGG = 7ra G a G a G a G a A 

1 2 3 4 12|34 

QGTGT = ira G a T a G a T a c 

1 2 3 4 12|34 

QGGTT = Tra G a G a T a T a A 

1 2 3 4 12|34 

Qgttg = Ka G a T a T a G a c 



(i - *)b G bib G b G bT\ 
(i - ,)b G bib G b^ 2 \ 

(l-n)bhb G b 3 T b^ 23 , 
(l-n)b G b^b%bT 3 . 



Since for the K2P and JC models a? 



a T> b G = b^p for all 



e, these formulae show / = 0, as can be checked using color- 
codes trees such as in SectionlHll Thus / g Z(Vt 1213ji *Vt 14123 )- 
On the other hand, for the tree T 13 \ 2 4 we have: 

1 2 3 4 13|24 

Qgggg = c G c G c G c G c A 

1 2 3 4 13124 

QGTGT = c G c T c G c T c A 

1 2 3 4 13 1 24 

qGGTT = c G c G c T c T c c 

1 2 3 4 13124 
qGTTG = C G C T C T C G C C 



C 



c|, and c A 



1 



Even though in the JC model C 
for all e, / is not identically zero when evaluated at these 
expressions. Thus / ^ 1(Vt 13 , 24 ) for the JC model, and hence 
also for the K2P model. ■ 



C. From Quartets to Sextets and Beyond 

Identifiability of quartet mixtures can be used to show 
identifiability for larger trees by marginalization of tree models 
and their mixtures. However, it is not, in general, possible 
to identify two trees from the union of their sets of induced 
quartet trees. Thus this approach requires some care. That all 
difficulties arise from trees of at most 6 taxa is the content of 
the following combinatorial theorem of Matsen, Mossel, and 
Steel. 

Theorem 18 (Six-to-infinity Theorem): [23] Suppose that 
the tree parameters T\ , T 2 are identifiable for a 2-tree phyloge- 
netic mixture model for binary trees with six leaves. Then tree 
parameters are identifiable for binary trees with > 6 leaves. 

Combining the results of Corollary [T7] and Lemma [3] we 
have that Vt x * Vt 2 % Vr 3 * Vt 4 , if there is a four element 
subset Q C [n] such that {Ti|q,T 2 | q } ^ {T 3 \ Q ,T 4 \ Q }. 

It remains to show that Vt x * Vt 2 % Vt 3 * Vt 4 for pairs 
of trees such that {T X \ Q , T 2 \q} = {T 3 \ Q , T 4 \ Q } for all four 
element subsets Q C [n]. Let Q(Ti,Tj) denote the multiset 
of all quartet trees T^\q, Tj\q induced by 7j and Tj. We say 
two pairs of trees Ti,T 2 and T 3 ,T 4 are quartet-matched if 
Q(T U T 2 ) = Q(T 3 ,T 4 ). 

Proposition 19: For n = 5 leaves, any two quartet-matched 
pairs of trees Ti,T 2 and T 3 ,T 4 has {T U T 2 } = {T 3 ,T 4 }. 
For n = 6 leaves, every quartet-matched pair of trees with 
{Xi,T2} ^ {T 3 ,T4} is equivalent, up to ©6 symmetry, to the 
pairs defined by 



7i = {12 3456, 123|456, 1234|56}, 
T 2 = {13|2456, 123|456, 1235|46}, 
T 3 = {13|2456, 123|456, 1234|56}, 
T 4 = {12 3456, 123|456, 1235|46}. 



(5) 



Proof: Fix two binary trees Ti,T 2 with n leaves. If the 
trees are identical, the result is clear, so we assume throughout 
T x ± T 2 . 

Consider first n = 5. 

If leaves j, k form a cherry in Tj, then they will also form 
a cherry in all 3 quartet trees including j, k induced from Tj. 
On the other hand, if they do not form a cherry in T,, then 
they will form a cherry in either or 1 of these 3 induced 
quartet trees. Thus by counting the elements of the multiset 
Q(Ti,T 2 ) with each possible cherry j, k, we can determine 
which cherries occur in both trees (count 6), which occur in 
exactly one tree (count 3 or 4), and which occur in no trees 
(count 0, 1, or 2). 

If a cherry occurs in both trees, suppose it is {1, 2}. Then 
from considering the quartets on {2,3,4,5} both T\ and T 2 
are determined. 

If the two trees have no cherry in common, then we know 
the 4 distinct cherries that occur in the 2 trees. If only 4 taxa 
occur in these 4 cherries, then we may uniquely pair them 
according to their compatibility, and the two 5-taxon trees T\ 
and T 2 are determined. If all 5 taxa occur in these cherries, 
since the cherries are distinct we may assume they are {1,2} 
and {3,4} (from one tree), and {1,5} and {2,3} (from the 
other), though we initially do not know which come from 
which tree. However, we again see that these can be uniquely 
paired for compatibility, and thus T\ and T 2 are determined. 

Now consider n = 6. 

By the n = 5 case, we may determine the multiset T = 
T[T\, T 2 ) of all 5-taxon induced trees from T\ and T 2 , so we 
work with it instead of Q = Q(T 1 ,T 2 ). 

By counting cherries in T, we may determine those possible 
cherries that occur in both trees (count 8), exactly one tree 
(count 4 or 5), or no trees (count 0,1, or 2). If a cherry occurs 
in both trees, suppose it is {1, 2}. Then from considering the 
5-taxon trees on {2, 3, 4, 5, 6} both T\ and T 2 are determined. 

For the reminder of the proof, we assume the trees have no 
cherry in common. Thus either 4, 5, or 6 distinct cherries occur 
in T\ and T 2 . In the case of 6 distinct cherries, compatibility 
of cherries determines T\ and T 2 . 

In the case of 5 distinct cherries, one of the trees must 
be symmetric, and the other a caterpillar. Either compatibility 
of cherries determines the symmetric tree (in which case 
both trees are determined by removing the quartets from this 
tree from Q and using the remaining ones to construct the 
second tree), or we may assume the 5 cherries have the form 
{1,2}, {3,4}, {5,6} (from one tree) and {1,3}, {2,4} (from 
the other), though of course we do not know which come 
from which tree. Since the cherry {5,6} is identified by this, 
consider the two elements of T on {1,2,4,5,6}. As {5,6} 
is a cherry in only one of these trees, and that one also has 
{1,2} as its other cherry, this identifies {1,2}. Thus {3,4} 
is also known, and thus one, and hence both, of the trees are 
determined. 

In the case of 4 distinct cherries, both T\ and T 2 are 
caterpillars. We first investigate whether we can determine 
which pairs of cherries occur on the same tree Tj. Since at 
least 2 of the 4 cherries must be incompatible, let these be 
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{1,2} and {1,3}. Either compatibility determines which other 
cherries these are paired with, or the remaining cherries have 
the form {i,j} with i,j e {4,5,6}, and we may assume the 
cherries are {4,6} and {5,6}. 

If compatibility determined the cherries on Ti as {1,2} and 
{i,j}, and those on Ti as {1,3} and {fc, I}, then we may 
assume j ^ 3. Then the two elements of T on all taxa but j 
can be matched with the Tj depending on whether they display 
the cherry {1, 2} or {1, 3}. This determines T\, and hence T2 
as well. 

This leaves only the case where the 4 cherries are {1, 2}, 
{1,3}, {4,6} and {5,6}, which may be paired two ways. 
Considering the two elements of T on {1,2,3,4,5}, ex- 
actly one must contain the cherry {1,2}. If the other cherry 
in this 5-taxon tree is {3, 5}, then this determines Ti as 
{12|3456, 124|356, 1234|56}, and hence T 2 is determined as 
well. Similarly, if the second cherry in the 5-taxon tree 
is {3, 4}, then T\ and T 2 are again uniquely determined. 
If the second cherry is {4, 5}, however, T\ may be either 
{12|3456, 123|456, 1234|56} or {12|3456, 123|456, 1235|46}. 
Considering the element of T on {1,2,3,4,5} that con- 
tains cherry {1,3}, we likewise obtain two unique trees 
except in the case where the second cherry is {4, 5}, in 
which case T 2 could be either {13|2456, 123|456, 1234|56} 
or {13|2456, 123|456, 1235|46}. Finally, since only one of Ti 
and T 2 can have cherry {5, 6}, the only remaining ambiguous 
case is that described in the statement of the Proposition. ■ 

Lemma 20: Consider the trees T x , T 2 , T 3 , and T 4 in equa- 
tions (f5]) from Proposition [19] Define the linear polynomial 

/ = QGGGGGG + QGTTTTG — QGTGGTG ~ QGGTTGG- 

Then, for the JC and K2P models, / satisfies 

fel(V Tl *V T2 )\l(V Ti ), zg{3,4}. 

In particular, Vr ; % Vt x * Vr 2 , i € {3,4}. 

Proof: By symmetry of the relationship between trees T 3 
and T4 to that of Xi and T2, it suffices to prove the statement in 
the case that i = 3. First, we will show that / £ T{Vt x *Vt 2 )- 
Denote by a's and 6's the parameters of the trees T\ and T 2 , 
respectively. One checks that 

1 2 3 4 5 6 1213456 1231456 1234156 

QGGGGGG = 7r a G a G a G a G a G a G a A a G a A 

i n Nil ,2 > 3 7,4 7,5 7,6 ,13|2456, 123|456, 1235|46 

+ {l-ir)b G b G b G b G b G b G b A l b G b A 

1 2 3 4 5 6 1213456 1231456 1234156 

Qgttttg = Tra G a T a T a T a T a G a G a G a c 

i ft \r,l t.2 7,3 t.4 j5 16 , 13|2456, 123|456, 1235|46 

+ (1 — TT)b G b T b T b T b^b G b c b G b c , 



Qgtggtg 



1 2 3 4 5 6 12|3456 123|456 1234|56 



ira G a^a G CL G CLrpCL G CL G 
1,1 1,2 u3 xA 



•c 



n \Ut 1,3 >4 ,s 7,6 7 13|2456, 123|456, 1235|46 

+ (1 - TT)b G b T b G b G b T b G b A b T b A 



1 2 3 4 5 6 12|3456 123|456 1234|56 



.1 



Qggttgg = Tra G a G a T a T a G a G a / 

/1 \i,l >,2 1,3 i,4 7,5 7,6 t,13|2456, 123|456, 1235|46 

+ (1 — TT)b G b G b T ofb G b G b c b T 1 b c . 

Recall that for the JC and K2P models, we have a G = a T and 
b c G = b e T for all e. Therefore, / 6 l{V Tl *Vr 2 ). 



On the other hand, if we denote the parameters for the tree 
T 3 by c's, we have that: 

1 2 3 4 5 6 13|2456 123|456 1234|56 



QGGGGGG = C G C G C G C G C G C G C A 



G 



Qgttttg 



1 2 3 4 5 6 13|2456 123|456 1234|56 



Crp C r p Crp Crp ^(J 



G 



-C 



1 2 3 4 5 6 13|2456 123|456 1234|56 



Qgtggtg = c G c T c G c G c T c G c A 



c 



qGGTTGG 



1 2 3 4 5 6 13|2456 123|456 1234|56 
C G C G C T C T C G C G C C °T C A 



Although for the JC model c e c = c e G = c T and c e A = 1 for all 
e, the linear polynomial / evaluated at the expressions above 
does not give the zero polynomial. Therefore / ^ l(Vr 3 ) for 
the JC, and hence for the K2P, model. ■ 

Finally, we pull together all of the results in this section in 
the proof of the main Theorem on tree identifiability: 

Proof: [Proof of Theorem [To) If the trees relate only 4 
taxa, Corollary [17] provides the claim. By Theorem [18] it is 
now enough to consider cases with n — 5,6. 

If {Ti, T 2 } and {T 3 , T4} are not quartet-matched, then there 
is a quartet Q of taxa such that {Ti| Q ,T 2 | Q } ^ {T 3 \q,T 4 \ q }. 
Thus by Corollary [P71 and Lemma [3] the claim follows. 

If {Ti,T 2 } and {T 3 ,T 4 } are quartet-matched, then by 
Proposition [19] up to symmetry, we need only consider the 
case described by equations (0. But then Lemma [20l implies 

I(Vt 3 *V Ti )\I(V Tl *Vt 2 ) 
contains a linear invariant, so 

V Tl * Vt 2 % Vt 3 * Vt 4 ■ 



V. Comparing 2-tree mixtures with unmixed 

MODELS 

In this section, we report on preliminary investigations on 
distinguishing unmixed models from 2-tree mixtures. More 
precisely, we study the following question: For which triples 
of trees T x , T 2 , T 3 is V Ts % V Tl * Vr 2 ? We have already used 
instances of this in establishing generic identifiability of trees 
in the 2-tree mixture model, but our earlier work does not 
yield a general answer. 

That we can distinguish a single-class, unmixed model Vt 3 
from a 2-tree mixture model Vji * Vt 2 , as long as T 3 is not too 
closely related to Ti and T2 is easily shown, however. Indeed, 
Lemma [161 and a variant of Lemma [3] imply: 

Proposition 21: If there is a four-element set Q C [n] such 
that T 3 | Q i {Ti| Q ,T 2 | Q } then V Ta % V Tl * V T2 . 

The smallest instance of a tree T 3 all of whose quartet trees 
arise from T\ and T 2 occurs with n = 5 leaves for the triple: 



Ti = {12|345,123|45}, 
T 2 = {13|245,134|25}, 
T 3 = {123|45,13|245}. 



(6) 



In fact, this example is unique up to the action of 65 on leaf 
labels. 

We performed a computation using the computer algebra 
program Singular [16], which rigorously verified the follow- 
ing: 
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Theorem 22: For the three 5-taxon trees Ti,T2,T^ in (O, 
under the JC model, 

Vt 3 Q Vn * Vt 2 ■ 
Proof: We explain the approach behind our computation. 

All of the JC varieties Vp for an n-leaf tree are invariant 
under an action of the torus (C*) n . This action arises from 
rescaling the pendant edge parameters. That is, if q G Vt 
and A G (C*)", then for any subforest F of T the Fourier 
coordinate qp is transformed as (A • q)p = QFYieeL(F) 
where L(F) is the set of pendant edges appearing in F, Since 
A • q G Vp, it suffices to prove the claimed containment in the 
theorem in the case where all pendant edge parameters on the 
tree T3 are set to 1. 

Let V and W be two varieties. Note that V C W, is 
equivalent to I(W) C T(V). This containment of ideals holds 
if, and only if, X(W) +2(V) = 1(V), which we use to speed 
up computations. Hence, it suffices to show that 

1(V Ti *Vt 2 )+1(V)=1(V) (7) 

where V is the subvariety of Vp 3 where all the pendant edge 
parameters are set to 1, 

Finally, though in principle it is possible to compute I{Vp 1 * 
Vr 2 ) directly, it is beyond current capabilities. However, an 
alternative approach to join ideals uses elimination: if /, J C 
C[q] are two ideals, their join ideal is 

I * J = (I(q r ) + J(q — q')) D C[q] 

where I(q') is the ideal I with variables q[ substituted for 
variables qi, and J(q — q 1 ) is ideal J with qi — q[ substituted 
for qi. Hence, we can test (0 by testing if 

1(V) = (I(V Tl )(q')+l(V T2 )(q - q') +X(V)(q)) l~l C[q\. 

This statement is verified by the code we provide in the 
supplementary materials [3]. ■ 

Theorem 1221 raises as many questions as it answers. First, 
note that it is a statement about complex varieties, and leaves 
open the possibility that A4p 3 % Mt-i *Mt 2 - We investigated 
this computationally as follows, using code available in [3]. 
Choosing random JC parameters on T 3 , we repeatedly pro- 
duced a point in Vp 3 C Vp 1 * Vt 2 , thus obtaining a sample 
with high probability of exhibiting generic behavior. For each 
such point, we then produced a system of algebraic equations 
whose solutions would give mixture parameters on T\ and 
T2 to produce this point. The solution set forms an algebraic 
variety, which in our trials was always of dimension 2. A 
primary decomposition of the ideal showed there were three 
components of the solution set, two of dimension 2 and one 
of dimension 1. 

One of the 2-dimensional components was defined in part 
by setting one internal edge length on T\ to infinity and one 
internal edge length on T2 to 0. The mixing parameter, all 
split parameters in T2, and all but 4 split parameters in T\ 
were uniquely determined. Two quadratic relationships in 2 
variables each held for the remaining parameters. The other 
2-dimensional component is similar, with the roles of T\ and 
T2 reversed. The 1 -dimensional component requires that an 
internal edge on each tree have length 0, but allows the mixing 



parameter to vary along with two edges on each tree. (See [3] 
for the precise results.) 

It is worth highlighting that the only 2-tree mixtures match- 
ing the 1-tree distribution were of this extreme nature, with 
some internal edges of length or infinity. If one allows 
these values, then there are instances of all mixture parameters 
being in a stochastically meaningful range. Of course formally 
establishing any conjectures these calculations suggest would 
require a detailed semi-algebraic analysis of these models. 

A second question Theorem [22] might lead one to ask is 
if T3 is a tree all of whose quartets comes from either T\ 
or T2, then is Vt 3 Q Vp 1 * Vr 2 ? However, we have already 
seen an instance where this failed in Lemma [20] It would 
be interesting to characterize precisely when these types of 
containments arise. 

Finally, it is not at all clear if the containment in Theorem 
1221 is a special phenomenon for the JC model, or if it can 
occur more generally for other group-based or more general 
phylogenetic models. Answering such questions will require 
an understanding of this phenomenon beyond the computa- 
tional perspective. 

VI. IDENTIFIABILITY OF CONTINUOUS PARAMETERS 

Assuming tree topologies are already known, we next ex- 
plore the generic identifiability of the continuous parameters in 
group-based mixture models. We use both rigorous arguments 
and computational approaches to address this issue. While 
standard laptop computers were sufficient for most of this work 
(see [3]), the more intensive computations were performed 
on a more powerful machine provided by Erich Kaltofen of 
NCSU. 

Proving a model has identifiable continuous parameters 
requires showing that the parameterization map is one-to- 
one. Without any special assumptions on the map, it may 
be one-to-one on some region of parameter space, but not 
on another. Well-known to algebraic geometers, however, is 
that parameterization maps defined by polynomial formulas, 
such as those for the models we study, have the nice feature 
that they exhibit a generic behavior. More specifically, there is 
some k G {1, 2,3,..., 00} such that for all parameter values 
except those in some exceptional set E, the map will be k- 
to-one (cf., Prop. 7.16 of [17]). Crucially, the set E where the 
generic behavior may fail is closed and of Lebesgue measure 
within the full parameter space (since it is a proper algebraic 
subvariety). In the case of complex univariate polynomials, this 
fact is more widely familiar: given an n-th degree polynomial 
p(z), for almost all a G C the equation p(z) = a has n distinct 
roots. However, for a finite number of exceptional values of a 
there may be fewer distinct roots. Thus p defines a generically 
n-to-one map from C to C. 

One can computationally determine the generic behavior 
with high probability as follows: For a specific choice of 
parameters, calculate the cardinality of the set of all other 
choices of parameters with the same image. If, for many 
such random choices, one finds this fiber is of size k, there 
can be little doubt that the map is generically fc-to-1. These 
computations can be performed exactly by computational 
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algebra software such as Macaulay2 [15] or Singular [16], 
and carefully performed repeated trials can give one high 
confidence. Of course, such an approach does not rigorously 
establish results. However, the use of random data to reliably 
study behavior of specific polynomial equations is not novel. 
For instance, Section 6 of [21] gives a different application of 
the idea in phylogenetics. 

This approach unfortunately does not give any quantifiable 
meaning to the term 'high probability,' as we lack any explicit 
information on the set E where non-generic behavior may 
arise. If a non-zero multivariate polynomial vanishing on E 
were known, we would only need to compute that the map 
was fc-to-one for a single point not satisfying that polynomial, 
and obtain a rigorous result. If we knew only the degree 
of such a polynomial, by the Schwartz -Zippel Theorem (cf, 
for instance, [26]), we could produce points with arbitrarily 
small probability of lying in E, and use these to quantify our 
terminology. However, we have no such information, and thus 
our confidence in having determined the generic behavior is 
based partly on experience. In choosing points for calculations, 
a useful heuristic is to pick coordinates to be random rational 
numbers (perhaps also requiring that they be expressible using 
disjoint sets of primes), in hopes that the unknown polynomial 
equations describing E are less likely to be satisfied. Indeed, 
if 25 points chosen in this way all produce the same value of 
k, while it is possible they all lie in E, the evidence is strong 
that they do not. 

We label statements with "Theorem*" or "Proposition*" if 
we are only highly confident of them through such computa- 
tion. Unstarred statements are rigorously proved. Thus while 
we are careful to distinguish between results with rigorous 
proof and those depending on such calculations, we are highly 
confident of both. 

One of the results we found computationally was a par- 
ticularly surprising non-identifiability result for continuous 
parameters of 4-taxon tree mixtures under the JC model. 
Nonetheless, passing to 5-taxon trees restores identifiability. 

The first main result in this section is: 

Theorem* 23: For the JC model, the continuous parameters 
in the 2-tree mixture with parameterization i\>t x ,t 2 are gener- 
ically identifiable for binary trees with n > 4 leaves, except 
in the case that n = 4 and T\ = T^. 

An issue that will arise in our proof of Theorem* [23] 
and related results, concerns the maps i/jt parametrizing Vt 
for the JC, K2P, and K3P models. The proof in [9] of the 
identifiability of numerical parameters for the general Markov 
model shows that identifiability of numerical parameters only 
holds up to permutation of states at internal nodes of the 
tree. Permuting the states at an internal node corresponds to 
permuting rows of transition matrices on edges leading out 
of the node, and columns of matrices on edges leading into 
the node. As any permutation of the rows or columns of a 
JC matrix that is also a JC matrix is identical to the original 
matrix, this implies the JC parameterization is generically one- 
to-one. For a generic K2P matrix, there are two orderings of 
the rows that have K2P form, and hence the parametrization 
map is generically 2 ,l_2 -to-one. For a generic K3P matrix, 
there are four orderings of the rows that have K3P form, and 



hence the parametrization map is generically 4"~ 2 -to-one. To 
avoid complications in statements due to these understood 
failures of identifiability in its strictest sense, it is more 
convenient to focus on the fc-to-oneness of the maps <f>Ti,T 2 , 
using equation (|TJ to relate results to ijjTi,T 2 - 

The first step toward Theorem* [23] is performing computa- 
tions to establish the following. 

Proposition* 24: Let T\ ^ T2 be binary trees with four 
leaves. Then for the JC model, the map 

(f>Ti,T 2 '■ Vri X Vt 2 X P 1 — » Vt x * Vt 2 

is generically one-to-one. 

Proof: [Calculation] From randomly chosen rational pa- 
rameters in the domain of ipT-L,T 2 , we computed a point 
p G Vri * Vt 2 . Solving the system of polynomial equations 
"0Ti,t 2 (si, S2, 71") = p determines the (complex) preimage of p. 
This preimage can be calculated using Grobner bases, and was 
found to consist of a single point for the many such random 
choices we made. We can be therefore be highly confident that 
V>Ti,T 2 is one-to-one, by the existence of a generic behaviour of 
any polynomial map. That 4>T lt T 2 is one-to-one then follows 
from the fact that i/jxi an d V j t 2 are generically one-to-one 
parameterizations of Vt x and Vt 2 ■ 

Code is provided in the supplementary materials [3]. ■ 

Although we attempted to perform similar calculations to 
extend Proposition* [24] to the K2P and K3P models, these 
failed to terminate in 3 weeks time. 

In the case of a mixture on two trees with the same topology, 
the possibility of interchanging the mixture components shows 
the map cannot be one-to-one. Generic identifiability thus 
corresponds to generic two-to-oneness in this case. For this 
type of mixture, we are able to perform computations for both 
the JC and Kimura models. 

Proposition* 25: Let T be a binary tree with four leaves. 
Then for the K2P and K3P models, the map 

4>t,t ■ Vt x Vt x P 1 — » Vt *Vt 

is generically two-to-one. For the JC model, the map </>t,t is 
generically twelve-to-one. 

Proof: [Calculation] The calculations which indicate this 
holds with high probability is similar to that for Proposition* 

m 

Code is provided in the supplementary materials [3]. ■ 
Note that the twelve-to-oneness in the case of the JC model 
is not merely a mathematical anomaly relevant to complex pa- 
rameter choices only. This type of non-identifiability for secant 
parameters can and does occur for stochastically meaningful 
parameters. 

Example 26: Searches of parameter space give instances of 
2, 4 or 8 stochastic parameter choices producing the same 
image in the 4-taxon 2-class JC mixture on a single tree 
topology. For instance, if a e = a e c = a G = a T denotes the JC 
parameters for one class on T, and b e — b c — b G = b T those 
for the other class, and tt the proportion of the first class, then 

7T = 0.1, 

a 1 = 0.05, a 2 = 0.10, a 3 = 0.12, a 4 = 0.04, a 12134 = 0.01, 
b 1 = 0.04, b 2 = 0.14, b 3 = 0.10, b 4 = 0.11, 6 12|S4 = 0.46, 
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and 7 other choices of parameters have the same image. 
Up to interchanging classes, there are 4 essentially different 
choices. Code verifying this example, and examples showing 
2 or 4 biologically relevant preimages, are included in the 
supplementary materials [3]. We do not know if exactly 6, 10, 
or 12 biologically relevant preimages can occur. 

We rigorously establish the following: 

Proposition 27: Let T be a binary tree with five leaves. 
Then for the JC, K2P, and K3P models the map 



4x4 blocks M, ( i> , M^i" and M (1) 



: Vt x Vt X 



12|345' ^"12|345 123|45' 

appropriate form. Thus the superscript (1) or (2) refers to the 
class in the mixture. 

Now the 8 x 16 matrix M 12 = M 12 | 3 4 5 (Ml ® row M 2 ), 
where ® row denotes tensor products of corresponding rows, 
gives probabilities of observing pairs of states at leaves 1 and 
2 conditioned on the state at p. A similar 8 x 16 matrix product 
M34 gives probabilities of observing pairs of states at leaves 
4 and 5 conditioned on the state at p. 

For any choice of tt and the M e (1) , M e (2) , the image X under 
<I>t,t o (tp x tp x id) has the same entries as the 3-way array 
[tt; M12, M3, M45]. But for generic choices of parameters, one 
can check that M12 and M45 have Kruskal rank 8, and M3 
has Kruskal rank > 2. Indeed, one need only check that this 
holds for a single choice of the parameters, since then the 
condition, which is defined by polynomial inequalities, can 
fail only on a proper subvariety of the parameter space. For 



r(i) 



is generically two-to-one. 

The proof of Proposition [27] depends on a result of 
J. Kruskal concerning uniqueness of rank 1 tensor decompo- 
sitions for 3-way arrays. As this has been exploited elsewhere 
[2], [7] to study identifiability of models, we give only 
essentials here. If Mi, M2, M3 are three matrices with r rows, 

and TV is an r-element vector, let rnj denote row i of matrix instance, choosing M{> = M^> to be a JC matrix with off 
Mj. Let diagonal entry 0.1, Mp' = M 2 = h, and M 12 \ 345 = I s , a 

calculation shows M\ 2 has rank 8, and hence Kruskal rank 8. 

Applying Theorem [28] we get that n, M 12 , M 3 , and M45 
are all uniquely determined up to simultaneous permutation of 
rows. 

However, because of the special form of the Markov ma- 
trices for the models we consider, for generic JC parameters 
there are exactly two orderings to the rows of M3 so that it 

is two stacked blocks of the correct form, and these differ by 

(i) 



[n; M 1 ,M 2 , M3] = y2 ^i^i ®m2® m 3 . 
i=i 

The form of Kruskal's theorem most useful for our purposes 
is the following, from [2]. 

Theorem 28: (Kruskal) Let n be an r-element vector of 
non-zero numbers, and Mi, M 2 , M3 three matrices with r 
rows, all of whose row sums are 1. Let Ii, the Kruskal rank 
of Mi, be the largest integer such that every set of Ii rows of 
Mi is independent, and suppose 

h + h + h > 2r 4- 2. 



Then if [n; Mi, M 2 , M 3 ] 
permutation P such that 



= [tt';M{,M^M 3 ], there is a 



simply interchanging the blocks. Thus we may recover Mc 

(2) 

and Mn , up to order. Fixing the ordering of the rows of M 3 
so M3 is on top fixes an ordering of the rows of tt, M121345, 
and A/123145 as well. Letting a superscript of 1 denote the top 
4 rows, and a superscript of 2 the bottom 4 rows of an 8 -row 
matrix, the mixture distribution can be written as 



n = Pn', M 1 = PM[, M 2 = PM' 2 , M 3 = PM! S . 
Proof: [Proof of Proposition [27] To fix notation, let T 
have non-trivial splits { 12 1 345, 123|45} and let p denote the 
internal node on the pendant edge leading to leaf 3. Denote by 
ijj the natural parameterization of Vt, in terms of the entries 
of 4 x 4 Markov matrices. As there is no advantage to working 
in Fourier coordinates here, we use standard ones for Vt and 
Vt * Vt- 

With id the identity map on P 1 , it is enough to show the 
parameterization map 4>t.t o ("ip x ip x id) of Vr * Vt is 
generically two-to-one. 

The map 4>t.t o(ipxipxid) can be made explicit as follows: 
Root the tree at p, and assign stochastic matrices to the edges 
of the tree giving conditional probabilities of state changes 
along those edges. If tt is the mixing parameter, and u = 
(1/4, 1/4, 1/4, 1/4), then an 8-element vector 

7T = (tTU, (1 — 7r)u), 

gives the state distribution at the root. On the pendant edge 
leading to leaf i, a 8 x 4 matrix Mi composed of two stacked 
Markov matrices and M> 2 ' of the appropriate form gives 
conditional probabilities, while on the internal edges there are 
8x8 block-diagonal matrices M121345 and M 12 3|45 with two 



(!) nAV 



u;M;>,M^>,M, 



45 



u;M$,M?\Mi? 



But this weighted sum is simply the weighted sum of the two 
points in the image of ip corresponding to the two classes. As 
ijj is known to be generically one-to-one, the parameterization 
of Vt * Vt is two-to-one in the JC case. 

As discussed following the statement of Theorem* [23] for 
the K2P and K3P models there are additional orderings of the 
rows of M3 so that is is two stacked blocks of the correct 
form. Regardless of which ordering we choose, however, by 
arguing as in the preceding paragraph we are led to the same 
two points in the image of ipT- Thus for these models also we 
see the parameterization of Vt * Vt by <\>t,t is two-to-one. ■ 

Note that the use of Kruskal's theorem in this proof extends 
to a 2-class CFN mixture model on a 5-taxon tree, as then the 
Kruskal ranks of the matrices M12 and M45 are generically 4, 
while M2 has Kruskal rank > 2 . Although we do not focus 
on that model here, we record the result, as it is helps place 
the examples of [24] for 4-taxon trees into context. 

Proposition 29: Let T be a binary tree with five leaves. 
Then for the CFN models the map 



&t t '■ Vr x Vt x 



-4 Vt * Vt 



is generically two-to-one. 
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Proposition* 30: For the JC 2-tree mixture model on 5- 
taxon binary trees, the continuous parameters are generically 
identifiable. 

Proof: If T\ and T 2 have no cherries in common, 
then all their induced quartet trees disagree. Thus applying 
Proposition* [24] to all 4-taxon marginalizations shows all 
parameters are generically identifiable. 

If T\ and T 2 have 2 cherries in common, they are identical, 
and Proposition [27] gives the claim. 

If T\ and T 2 have a single cherry in common, we 
may assume they are T x = {12|345, 123|45} and T 2 = 
{12|345, 124 35}. Also, since the parameters are generic, we 
may assume the mixing parameter giving the class size for 
the T\ component is tt ^ 1/2. Then marginalizing to the 
taxa {1,3,4,5} and applying Proposition* l24l identifies the 
parameters on 4 edges of each of the trees, as well as the 
class size tt for T\. 

Marginalizing to quartets involving taxa 1 and 2, and 
applying Proposition* [25] to them, there are 12 points in a 
generic fiber. However, such a generic fiber will have 6 distinct 
pairs of values {tt, 1 — tt}, and we use the value of the mixing 
parameter tt determined above to match parameters with T\ 
and T 2 . ■ 

The results above allow us to argue for the generic identi- 
fiability of parameters claimed in Theorem* [23] 

Proof: [Proof of Theorem* [23] The n — 4 case is 
Propositions [24] and the n — 5 case is Proposition* [30] 

For n > 5 leaves, by assuming that the parameters are 
generic we may also suppose the mixing parameter tt ^ 1/2. 

By marginalizing to 5-taxon subsets, and applying Proposi- 
tion* [30] we may identify parameters on each pair of induced 
5-taxon trees, but we must determine which come from which 
tree. If there is at least one 5-taxon subset for which T\ 
and X" 2 induce different subtrees, then we know the class 
size parameter tt for T\. Using this known value, we can 
determine which induced 5-taxon parameters arise from T\ 
and which arise from T 2 , even when the 5-taxon subtrees are 
topologically the same. If all 5-taxon subtrees of T\ and T 2 
agree, so T\ = T 2 , then we instead use the value of tt\ to 
collect 5-taxon subtree parameters from each copy of the tree. 
As the parameters for T\ and T 2 are elements of the collection 
of induced parameters, we thus identify all parameters on the 
full trees. ■ 

In closing, note that the arguments in the proof of Theo- 
rem* [23] in combination with the results of Proposition [27] 
rigorously prove the following result, in the case of identical 
tree topologies. 

Theorem 31: For the JC, K2P, and K3P models, the con- 
tinuous parameters in the 2-tree mixture on the same tree 
topology are generically identifiable for binary trees with 
n > 5 leaves. 
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